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For the Bernoulli Matching model of sequence alignment problem we apply the Bethe ansatz tech- 
nique via an exact mapping to the 5-vertex model on a square lattice. Considering the terrace-like 
representation of the sequence alignment problem, we reproduce by the Bethe ansatz the results for 
the averaged length of the Longest Common Subsequence in Bernoulli approximation. In addition, 
we compute the average number of nucleation centers of the terraces. 
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I. INTRODUCTION: BERNOULLI MATCHING MODEL OF SEQUENCE ALIGNMENT 

The goal of a sequence alignment problem is to find similarities in patterns in different sequences. Sequence 
alignment is one of the most useful quantitative methods of evolutionary molecular biology [l|, H, [1]. A classic 
alignment problem deals with the search of the Longest Common Subsequence (LCS) in two random sequences. 
Finding analytically the statistics of LCS of a pair of sequences randomly drawn from the alphabet of c letters is a 
challenging problem in computational evolutionary biology. The exact asymptotic results for the distribution of LCS 
have been derived recently in in a simpler, yet nontrivial, variant called the Bernoulli Matching (BM) model (see 
details below) . It has been shown in ^ via a sequence of mappings that in the BM model, for all c, the distribution 
of the asymptotic length of the LCS, suitably scaled, is identical to the Tracy-Widom distribution of the largest 
eigenvalue of a random matrix whose entries are drawn from a Gaussian Unitary Ensemble (CUE) Q . 

The problem of finding the longest common subsequence in a pair of sequences drawn from alphabet of c letters is 
explicitly formulated as follows. Consider two sequences a = {ai, a2, • . . , Ui} (of length i) and /3 = /32, ■ ■ ■ , Pj} 
(of length j). For example, a and /3 can be two random strings of c = 4 base pairs A, C, G, T of a DNA molecule, e.g., 
a = {A, C, G, C, T, A, C} with i = 6 and /3 = {C, T, G, A, C} with j = 5. Any subsequence of a (or (3) is an ordered 
subhst of a (or (3), i.e. subsequences which need not be consecutive. For example, {C,G,T, C} is a subsequence 
of a, but {T, G, C} not. A common subsequence of two sequences a and /3 is a subsequence of both of them. For 
example, the subsequence {C, G, A, C} is a common subsequence of both a and p. There are many possible common 
subsequences of a pair of sequences. The aim of the LCS problem is to find the longest of them. This problem and 
its variants have been widely studied in biology 0, H, EO) [HI , computer science 0, El, [H, , probability theory 
I20I [21I and more recently in statistical physics [H, [2I] . A particularly important application of the 
LCS problem is to quantify the closeness between two DNA sequences. In evolutionary biology, the genes responsible 
for building specific proteins evolve with time and by finding the LCS of the 'same' gene in different species, one can 
learn what has been conserved in time. Also, when a new DNA molecule is sequenced in vitro, it is important to 
know whether it is really new or it already exists. This is achieved quantitatively by measuring the LCS of the new 
molecule with another existing already in the database. 

Computationally, the easiest way to determine the length Lij of the LCS of two arbitrary sequences of lengths i 
and j (in polynomial time ~ 0{ij)) with no cost of gaps can be achieved using the simple recursive algorithm |3. [23j 



Lij = max 



(1) 



subject to the initial conditions Li o — ip.j — -^0,0 — 0, where the variable rjij is: 




1 if characters at the positions i (in a) and j (in /3) match each other, 
otherwise 



(2) 
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In FiglT^ the matrix of the variables r]i,j is shown for a particular pair of sequences a = {A, C, G, C, T, A, C} and 
f3 — {C,T, G,A, C} discussed above. Any common subsequence can be represented by a directed path connecting 
sequentially {Vii.ji, 1112^2^ ■■■^Vujsj --^Vi^jA^ where iji^j^ = rji^j^ = ... = rj^^j^ = ... = Vi^.j^ = 1 and i^+i > is, 
js+i > is for all 1 < s < n. Two particular realizations of common subsequences, namely {C,G,A, C}, and {A, C} 
are shown in Figllji by two broken lines connecting '1' (the first one is the one of LCSs). In Fig[T)3 we show the table 
of all Li^j (0 < * < 6, < J < 5) corresponding to the matrix 77^.^ (the first line and the first column are the boundary 
conditions Lqj = Lj^ = 0). Let us note the terrace-like structure of Fig[T]3: the numbers from to 4 could be viewed 
as different 'heights' of the terraces. We shall address later to this representation. 



sequence a i / 







A 


C 




c 


T 


A 


C 






























C 




® 




1 






1 












1 


1 


1 


1 


1 


1 


seq 


T 










1 
















1 


1 


1 


2 


2 


2 


uen 


























1 


2 


2 


2 


2 


2 




CD 

■CD 


A 


1 



















1 


1 


2 


2 


2 


3 


3 




C 




1 




1 













1 


2 


2 


3 


3 


3 


4 



(a) (b) 



FIG. 1: (a) Matrix of variables r/ij; (b) Table of lengths Li^j of all LCSs corresponding to (a). 



For a pair of fixed sequences of lengths i and j respectively, the length Lij of their LCS is just a number. However, 
in the statistical version of the LCS problem one compares two random sequences drawn from the alphabet of c letters 
and hence the length L^., is a random variable. The statistics of Lij has been intensively studied during the past 
three decades (isllisl. [iTlilSi. ili^. For equally long sequences (i — j ~ n), it has been proved that (L„,„) ~ 7cn for 
n ^ 1, where the averaging is performed over all uniformly distributed random sequences. The constant 7^ is known 
as the Chvatal-Sankoff constant which, to date, remains undetermined though there exists several bounds [3, [l^ [l^ , 
a conjecture due to Steele [13] that 7c = 2/(1 + y^) and a recent proof [20( that jc 2/-yc as c — *■ 00. Unfortunately, 
no exact results are available for the finite size corrections to the leading behavior of the average {Ln,n), for the 
variance, and also for the full probability distribution of Thus, despite tremendous analytical and numerical 

efforts, exact solution of the random LCS problem is far from being completely resolved. One feature that makes this 
problem particularly complicated is that the variables rji^ that are defined in ([2|) are not mutually independent but 
are correlated. To see that consider the simple example - matching of two strings a = AB and /3 — AA. One has by 
definition: 771^1 = 771,2 — 1 and 772,1 = 0. The knowledge of these three variables is sufficient to predict that the last 
two letters do not match each other, i.e., 772,2 = 0. Thus, 772,2 can not take its value independently of 771,1, 771,2, 772,1. 
Note however that for two random sequences drawn from the alphabet of c letters, the correlations between the 77^,^- 
variables vanish in the c — s- cx) limit. 

A first natural question is whether the problem is solvable in the absence of correlations between the rjijsl This 
question leads to the Bernoulli Matching (BM) model which is a simpler variant of the original LCS problem where 
one ignores the correlations between 77i,j's for all c [235 . The length Lf^^ of the BM model satisfies the same recursion 
relation as in Eq.([T]) except that 77i,j's are now independent and each rji^j is drawn from the bimodal distribution: 

This approximation is expected to be exact only in the c — + 00 limit. Nevertheless, for finite c, the results on the BM 
model can serve as a useful benchmark for the original LCS model to decide if indeed the correlations between rjij^s 
are important or not. Progress has been made for the BM model which, though simpler than the original LCS model, 
is still nontrivial. The average matching length (L^^) in the BM model, for large sequence lengths, n, was first 
computed by Seppalainen [2l[ using probabilistic method and it was shown that {Lf^n) ~ 7c '^'^n for n ^ 1 where 
Ic^ = 2/(1 + Y^), same as the Steele's conjectured value, 7c, for the original LCS model. Later the same result was 
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rederived in the physics hterature using the cavity method of the spin glass physics. Recently, in two of us 



derived the asymptotic limit law for the distribution of the random variable and showed that for large n 



L 



BM 



(4) 



where 7^ ~ 2/(1 + ^/c) and x a random variable with a n-independent distribution, Prob(x < x) — Ft^t^{x) 
which is the well studied Tracy- Widom distribution for the largest eigenvalue of a random matrix with entries drawn 
from a Gaussian unitary ensemble Q . For a detailed form of the function Ftw {x) , see Q . We also have shown that 
for all c. 



/(c) 



cl/6(^- 1)1/3 



(5) 



This allowed us to calculate in [j| the average length L^, 
as the variance of i^*^ for large n. 



including the subleading finite size correction term, as well 



BM\ 



Var(Lf,f 



ix) I{c)n^" 
{{x^)-{xf) f{c)n 



(6) 



where we have used the known exact values ^,{x) = -1.7711 ... and (x^) -^M1= 0.8132 .... The recursion relation 
([T]) can also be viewed as a (1 + l)-dimensional directed polymer problem [22, l23| and some asymptotic results (such 
as the 0(n^/^) behavior of the variance of Ln,™ for large n) can be obtained using the arguments of universality [23 |. 
However the limiting Tracy- Widom distribution and the associated exact scale factors in Eqs. (4-6) derived in [4| can 
not be obtained simply from the universality arguments. 

As it has been mentioned above, the level structure depicted in FiglT]3 can be viewed as 3-dimensional terraces. 
Namely, let us add one extra 'height' dimension to the system of level lines in Fig[T)D. Each time when we cross the 
level line constructed via the recursion algorithm ([l]), we increase the height by one. Hence, the length Lf^^ can 
be interpreted as the height of a surface above the 2D (i, j) plane. Considering the 2D projection of the level lines 
separating the adjacent terraces in FiglTJa, we can note that the rule H]) prohibits the overlap of these level lines, i.e., 
different level lines cannot have common segments or edges — see FigI2^,b. The resulting 3-dimensional system of 
terraces shown in Figl2}D is in one-to-one correspondence with the 2-dimensional system of level lines in Fig[2^. Note 
that a similar (but not identical) model with terrace structures and the associated level lines also appeared in an 
anisotropic 3D directed percolation model (2^ 1 , which in turn is also related to the directed polymer problem studied 
by Johansson [2^. However, the levels lines in the directed percolation model can overlap. In contrast, the level lines 
in our model do not overlap. These two models are nevertheless related by a nonlinear transformation [1, [l^l- The 
connections among all mentioned and some other related models is briefly discussed in the conclusion. 



00 

CD 

00 

CN 



1 2 3 4 5 6 

i 

(a) 




(b) 



FIG. 2; Bernoulli Matching model: (a) Level lines. This is just Fig[T]D rotated by 90°. Note that adjacent level lines have no 
common segments or edges; (b) 3-D 'terrace-like' representation of the heights. 
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All previously derived results (|4])-([6]) deal with the height statistics in the 3-dimensional terrace representation of 
Bernoulli Matching model. In terms of the level lines in the projected 2D plane (see Figl^^), the height Lf^' is just 
the total number of level lines within the rectangle of side lengths i and j. Note that any configuration of the level 
lines in the 2D plane has an associated statistical weight (see later for details). These weights are such that one can 
interpret a projected 2D configuration of the BM model as a 5-vertex model. This 5-vertex model is an interesting 
model in its own right (apart from its connection to the BM model) and it is natural to study the statistical properties 
of various objects associated with this two dimensional 5-vertex model. One such quantity is the total number of level 
lines inside the rectangle with sides i and j that translates into the height in the BM model. Similarly, there are other 
random variables such as the total number of corners and the total number of horizontal segments in the rectangle 
of sides that have not been studied before, and that have nontrivial and interesting statistical properties. For 
example, the 'left' corners shown by the big dotted points in FiglS^ are the so called 'nucleation' centers for the 
terraces and play an important role in the mapping between the BM model and the so called 'Longest Increasing 
Subsequence' (LIS) problem [1]. In the hmit of large number of letters, c (c ^ oo), one can show that these nucleation 
centers are Poisson distributed in the 2D plane with a uniform density pc = 1/c However, for finite c, the statistics 
of the number of nucleation centers is, to our knowledge, still unknown. In particular, one would expect that even 
the average density Pc{x, y) of the nucleation centers for finite c is nonuniform in the [x, y) plane and has a nontrivial 
form that reduces to the uniform value pc ^ 1/c in the c — s- oo limit. In the present paper, we shall calculate explicitly 
the average density pc{x,y) of the nucleation centers for finite c using the Bethe ansatz technique. 

The outline of this paper is as follows. In section[TTl we explain the precise mapping between the Bernoulli Matching 
model and the five vertex model and we recall the associated Bethe equations. We then solve the Bethe equations in 
cylindrical geometry. This allows us in section IIIII to calculate the mean flux of the word lines in the 5-vertex model 
as well as the associated large deviation function. Reverting to the Bernoulli Matching model, our calculation leads 
to an independent derivation of the expectation of the Longuest Common Subsequence. In section IIVI we determine 
the full statistics of the number of 'left' corners. In terms of the terrace model these left corners play the role of 
nucleation centers: our approach allows us to calculate the mean number of such centers. Concluding remarks and a 
flowchart of various models related to the Bernoulli Matching model are given in section |Vl 



II. BERNOULLI MATCHING AS A 5-VERTEX MODEL: BETHE EQUATIONS 



The statistical weight of a projected 2D configuration of lines of Bernoulli Matching model depicted in Figl5] is 
the product of weights associated with the vertices on the 2D plane. Let W{C) be the total weight of a full 2D 
configuration of the system of lines in Fig[2^. Note that a new terrace is nucleated with probability p ^ 1/c and 
hence we associate a weight p with each nucleation center. These nucleation centers are the 'left' corners of the lines 
shown by the big dotted points in Fig[2^. On the other hand, each empty vertex in Figl2^ has a weight q = 1 — p 
(corresponding to a mismatch in the BM model). The other filled vertices that are not 'left' corners have an associated 
weight of 1 each. Thus, we may write the total weight of a configuration C of vertices in Fig[2^ as 

WiC) = p'^- q^" with p^-, (7) 

c 

where Nc and are the numbers of 'left' corners (shown by big dotted points in Fig[2^) and empty vertices respec- 
tively. 

One can view the configuration of trajectories in Figl2^ as an assembly of world lines of hard-core particles moving 
on a one-dimensional lattice. In this alternative representation the horizontal {x) and the vertical (y) axes in Fig[3^ 
are correspondingly the 'time' and the 'space' directions of the ID lattice gas picture. The dynamics of these hard- 
core particles is as follows. Particles are moving along the lines as indicated in FigO and each particle tries to jump 
to any of the empty slots available to it before the location of the next particle to its right. Let P{s\m) denote the 
probability that a particle hops s steps, given that the next particle to its right is at a distance m. Thus there are 
(to — 1) holes between the two particles and therefore s = 0, 1, 2, . . . , (to — 1). The eq.([7]) for the total weight W{C) 
of the configuration dictates the following choice of this hopping probability: 

P{s\m) = p^-^'^>q^-^-^ . (8) 

One can easily check that the probability P{s\m) is properly normalized: P[k\m) = X)!^^ P{k\m) = 1. 

There are five types of possible vertices with nonzero weights as shown in FigjHJs. Since the level lines never cross 
each other, the weight is always zero. As it is seen from definition of vertices in Figl5]3, we draw a solid line for the 
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arrow ^ or J,. Otherwise we leave a bond empty. These rules slightly differ from the standard notations in Baxter's 
book [2^ , but correspond to the notations used in the paper [l^l where the Bethe ansatz equation has been considered 
for the 5-vertex modeL Suppose now that the weights {wi, ...jWe} are as follows: 

LJi = 1 ; W2 = 1 ; (^3^0; Ldi = q=l-p; u^^p; c^^e = 1 • (9) 

Let us note that in aU Bethe equations below the corner weights lo^ and cjg always enter in the combination uj^loq. 
Hence only the product weight loc^ojq — p is properly defined. Henceforth wc shall call the 'left' corners (shown by 
the big dotted points in Fig. (3a)) as the 'corners' on which we will mainly focus. We are not interested here in the 
'right' corners that correspond to the termination of the horizontal part of each world line. 




° 1 2 3 4 5 6 
(x), space 



(a) (b) 

FIG. 3: (a) System of 2D level lines separating adjacent terraces and the left corners depicted by big dots, (b) the weights of 
the associated vertices. Note that the labels on the axes of Fig. (3a) correspond to the labels in Fig[2]D, but the 'time' actually 
runs vertically downwards. The labels (the numbers) on the vertical axis do not correspond to time. 



We first consider our vertex model in a cylindrical geometry with N sites in each row and we assume there are 
infinite number of rows. As usual, the number m of 'up arrows' (t) in each row is conserved. Because of the one-to-one 
correspondence of 'up arrows' and 'solid' vertical bonds in a row, m is the number of solid vertical bonds and N m 
is the number of empty vertical bonds in a row. 



The grand canonical partition function, Z{p), of a 5-vertex model reads 

Z(p) = ^<"c.^<ic.5C.6)^=. 



(10) 



where N^, Ny, N^., Nc are correspondingly the numbers of: horizontal and vertical bonds, empty vertices and corners 
(by corners we mean only 'left' corners shown by the big dots in Fig. (3a)) for any particular configuration of the 
lines (or equivalently of the arrows). The summation in pop runs over all available configurations of these world lines, 
satisfying the non-crossing and non-overlapping constraints. 



The standard Bethe equation for the roots Zj of the 5-vertex model is as follows (see [23| for details): 



,N 



(-1) 



N-m- 



N-m 

■nr 



Az,- 



(j = l,2,...,iV-m) 



(11) 



where 



The highest eigenvalue, A 

m 

n 



can be written in two equivalent forms: 



A, 



.N- 



i = l 



bJiZj 



■ W4 



.AT- 



n 



LUlZ-j 



, ,m N-m 



N-m 

n 



a'2W4 



(12) 



(13) 
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The second expression will be more convenient for our further computations. Given the weights ([9]), we find 

A=i— ^ = 1. (14) 

Hence we obtain the following Bethe equations 

N-m 

(-1)"'-™"^ n (15) 

r 



N 

And Am is given by Eq. lflB]) : 



A, 



i-9rnT^+'?''n?^= Vi^^^vz,). (16) 



j=l J ^ j = l 3 j=l 



with Zj (j = 1, ...,N — m) defined by p5)) . 

Equations (fTS]) and are almost identical to the Bethe equation for roots and to the expression for the highest 
eigenvalue of the transfer matrix of the totally asymmetric exclusion process (TASEP) [2811. In the context of the 
exclusion process, these equations have been studied by many authors [1^ [s^, HH, |3l|, [s^] (for a recent review see 



III. ANALYSIS OF THE BETHE EQUATIONS 

A. Statistics of the world lines: the flux 

The averaged 'flux', $, in the system of world lines shown in FigO^ is equal to the typical length (normalized per 
N) of a horizontal segment, {Nh), between left and right corners averaged over all configurations. Hence, to study 
the statistics of we add an extra weight to each horizontal bond and each essential corner. We thus define the 
partition function Z(p,^) with the following collection of weights (compare to ([9])): 

= e'^ ; ^2 = 1 ; ^3 = ; W4 = g = 1 - p ; Wscjg = pe^ ■ (17) 

The mean value {Nh) can then be computed in a standard way 



j. _ {^h) ^ conf = — — 1 Zi ) 

N Nj^i^-pf-ien'^HpeT^ ^ m=o 

conf 

The fluctuations of the average flux can be derived in the similar way: 

The particular choice of weights ^T7]i leads to the following expression for A defined in JT 



(18) 



(19) 



A = "^"^ ' "'^"^ ^ , (20) 
while the general form of Bethe equation (|lip remains unchanged. The highest eigenvalue A„ reads now 

Km = u^r^f n 1 + ^^j- ) = n (1 ~ + p'^''^) ■ (21) 
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We proceed further using the technique of analyzing Bethe equations proposed in [3l|. Making the change of 
variables 



Hj = Zj A — 1 — Zj ~ 1 , 
we can seek the solution of Bcthe equation pT|) rewritten for i/j : 



in the form 



Qi/Q^2-^ij/Q(^l 



]N/Q 



(22) 



(23) 



(24) 



where _B is a constant that will be determined self-consistently, and Q — N — m. We arrive finally at the system of 
equations for the parametric determination of the partition function Z{p,fi) = (A^)^ in the thermodynamic limit 

N ^ oo: 



1 ^ 
— lnZ(p,^) = ^ln(l+p?/j) 



(25) 



Equations (I24p - ||25p can be rewritten in a closed form using the standard residue formula — see, for example [34 

i 

In Z(n. ii} = (T) 

N 



^lnZ(p,.) ^ i/Eln(l+P.)(l-^^) 



dy 



3 = i 

Q 



Qy + l) y- B^IQe^-^^i/Q{l + y)^/^ 



(26) 



Solving (|26p we express hiZ{p,fi) as a series expansion 



N y 



dy 



y + lj y- B^/Qe^'^'^/Q (1 + y)^/Q 



,T{Nk,Qk) 



k=l 



fc (A^fc-1)! 
{Qm)\{{N - Q)k)\ 



(27) 



where 



Qk-l 



xQfc— r?i' — 1 



(28) 



For the computation of the expectation and the variance of the flux it is sufficient to cut the series ([28|) at the 
second term. 



1 In Z[p, fi) = pBT[N, Q) + ^pB^T{2N, 2Q) 
' Q\[N-Qyr (2Q)!(2(iV-Q))!' 



(29a) 
(29b) 



where 



T{N,Q) 



N \ I p f N 



I J 1 + ^p p + pq\Q-l 



p ( 2N 



/ 2N 

[2Q-lJl + i^p ~ p + pq\2Q-l 



(30) 
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and we have defined the density p (of world lines) in (l30l) in the limit N oo &s 

m _ N-Q 

Now we can extract the constant B from the (|29b|l and substitute it into (|29ap . After some algebra and using 
Stirling formula, we arrive at the expression for the free energy (mean value and fluctuations) of the system in the 
thermodynamic limit — s- oo in the ensemble with fixed density p of world (level) lines (i.e. with fixed total number 
of world lines, m = pN) and fugacity, /i, of horizontal bonds (including corners): 

N N P + qp 4: p + qp p^/^ 

Differentiating Eq. ((32l) with respect to /i and putting fi = 0, we get the expectation of the flux $ in the system: 

^ = P^tlPl, (33) 
p + qp 

This expression thus reproduces the result obtained in [l^] by a different method. According to the variance 
Var($) reads 

Var($).^^(i^A-V.. (34) 
4 p + qp p^'^ 

More generally, eliminating B between the two equations of ([77]) . allows us to determine the moments of ^ to any 
desired order. 



B. Expected Length of the Longest Common Subsequence in Bernoulli Matching model 



Let us return to the system of world lines shown in Fig[2^. Our aim in this subsection is to derive a macroscopic 
coarse-grained equation of state for the average height L{x, y) in the BM model which is precisely the average length 
of the longest match. To derive this, we return to the system of world lines shown in Fig[3^ then closely follow a 
similar line of arguments that were used in Let us first define dxL{x, y) = ^^^^'^^ and dyL{x, y) = These 
are random variables and we would like to first estimate their average values. First consider crossing the world lines 
by going horizontally along the x direction. The variable dxL{x, y) takes the values 



dxL(x, y) 



if we do not cross a level line 

1 if we cross a level line 



Thus the average value of dxL{x, y) is the number of lines encountered per unit distance along the horizontal {x) 
direction. In the ID-lattice gas language this is just the density p of particles. Hence, our first relation is: 

dxL{x,y) ^ p. (35) 

Let us now try to relate the second object dyL{x,y) also to the density p. For this, let us imagine crossing the 
world lines along the vertical (y) axis. It is clear that one can write. 



^ dx dx 
dy dy 



dyL{x, y) = dxL{x, y)— = p— (36) 



and then replace ^ by its coarse-grained value, namely, 

dx («,.) 



ay - iv, <=•■' 
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where Ni is the total number of world lines or equivalently the total number of particles in the 1-d lattice with N 
sites. The rhs of Eq. ([57)1 is just the mean length of the total horizontal segment in each row per world line and hence 
is also the mean horizontal distance travelled by each particle in unit time (note that one unit of time corresponds 
to traversing one row in the vertical direction) and hence it represents the macroscopic value of ^ on the Ihs of Eq. 
([37|) . Now, writing Ni = pN, we immediately get from Eqs. (f37|) and ([36|) 



where we have used the expression of $ from Eq. 

Note that the density p in both Eqs. (|35p and (|38p is an unknown macroscopic variable, which however can be 
eliminated from the two equations. This then gives the desired equation for the average surface height: 

p{l- d^L{x, y) - dyL{x, y)) ^ q dxL{x, y)dyL{x, y) . (39) 

Solving ([59]) , we find the average profile of the surface shown in Figl^b 

2^/pxy — p(x + y) , ^ 

L{x, y) = ^ . (40) 

<? 

Note that this result is vahd only in the regime px < y < x/p. This is because at y = x/p, L{x,y) = x and hence 
p ~ 1 already achieves its maximum value. Hence, in the regime y > x/p, L(x,y) sticks to its value L{x,y) = x. 
Symmetry arg uments show that for y < px, L{x, y) sticks to the value L{x, y) = y. We note that this result was first 
derived in [21| using probabilistic methods. Putting a; = y = n in (|40|) and recalling that p = 1/c, we arrive at the 
expression 

2 

Hri,n) = — — j=n, (41) 
1 + Vc 

which coincides with the first term of Eq.® for the expectation (L„,„). 



IV. STATISTICS OF CORNERS 



In this Section, we compute the mean number of corners which play the role of nucleation centers in the terrace 
model. As mentioned above, the expression (|40p is valid in the angle px < y < x/p, while outside this region the 
average surface height is linearly increasing along x for y > x/p and along y for y < px. Hence, the complete expression 
for L{x,y) is as follows: 



L{x,y) 



( 1 

9 



Ipxy — p(x 

X 

y 



y)j if px < y < x/p 

\iy>x/p (42) 
'\i y <px 

The function L{x,y) is depicted in Figl4]for two different values oi p. 

Since the function L(x, y) is symmetric with respect to axis y = x, we can consider the sector x < y only and extend 
the obtained results to the sector x > y afterwards. The local density of lines, p{x, y), is defined in the following way 



= 1 



P{x,y) = 



dL{x,y) 
dx 

dUx,y) ^ VP f 



dx 



if X, y belong to the sector I, y > ^ 



Vp I if a;, ?/ belong to the sector II, a; < y < 



(43) 



Note that in all the results that we derived in the 5-vertex model in a cylindrical geometry we have assumed a 
constant density p of world lines using it as a given fixed parameter of the model. Now, as seen above in Eq. (|43p . 
for the BM model in 2-D, the density of lines p{x,y) is not a constant, but is a function of the space. To use the 



10 



y 
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1 





1 2 3 4 5 

X 




1 2 3 4 5 

X 



(a) (b) 

FIG. 4: Averaged surface height L{x,y) given by (|42p for two different values: (a) p — 0.5 and (b) p — 0.1. 



5-vertex results in calculating the mean corner density in the BM model, we will use a 'coarse-grained' description in 
the following sense. We consider the BM model on a very big lattice. Now, we consider a part of these world lines 
over a sufficiently big 'coarse-grained' region around the point {x,y). In this 'local' region, we will consider the line 
density p{x,y) to be sufficiently slowly varying function and use it as a constant input in the corresponding 5-vertex 
model to calculate the 'local' corner density in the region around the point {x,y). 

For each value of the fixed line density, p{x,y), we need to evaluate the corner density, pc{x,y) from the 5-vertex 
model. The vertex weights in this case are as follows (compare to ([9|) and (fT7|l ): 

wi = 1; W2 = l; t^3 = 0; uj4=^q=l-p; ujr,ue = pe" . (44) 

The mean value (Nc) can be computed as follows 



— -—\nZ(p,a) 

J2{l^pr^ipe'^r^ da 



d 



(45) 



Q=0 



conf 



The fluctuations of the average number of corners can be derived in the similar way as the variance of the flux. Thus, 
the expression for the variance Var(iVc) is as follows 



Var(7V,) = {Nl) - {N,f - ^\nZ{p,a) 



a=0 



The highest eigenvalue now reads: 

JV-rn , \QQ 



with 



= A - 1 = 



1 and A = 



UJ1LO2 — uj^LiJf, 1 — pe° 



1 — pe" q 
Solving Bethe equations, we get the parametric system of equations defining the free energy (compare to (j27p ): 



(46) 



(47) 



(48) 



— lnZ(p,a) = -Qln hpe 

k=l 



N 



k=l 



k {Nk- 1)1 
{Qm)\{{N - Q)k)\ 



(49) 
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with T{Nk, Qk) as in ([28)1 . Proceeding as in Section Hill we arrive at the following expression for the free energy of 
the system in the ensemble with fixed density, p, and fugacity of corners, a: 



^ ^ 7 ( \ - ^ ^ 7( \ p(l-p)(l-pe°) 
— lnZc(p,p,a) = — \nZ{p,a) = - , TTT^TT 



N 



pe- (1 - pf/' ^,^3/2 



where 



pe" + (1 -pe")/9 ' 2 pe" + (1 - pe«)/9 pi/2 
1 -pe" 



77 = In ■ 



9 



Using ((45|) . ((50)) - (|50)) we derive the expression for the corner density 

(iVc> 1 ainZc(p,p,a) 



Pc = 



da 

After simple computations, we obtain for pc the following equation 

p{l - p)p 



a=0 



(50) 



(51) 



(52) 



(53) 



p + qp 

where for p we should understand the local line density p{x,y) given by (|43| . Substituting (|43)) into (|53)) we arrive at 







Pc(x,?/) 



for w > - 



for X < y < 



(54) 



Since the function pc{x,y) is symmetric with respect to y = a; axis, we can straightforwardly reconstruct the values 
of Pc{x, y) in the regions III and IV from (I54p . 



The average number, (A^c), of corners in the square box U, of size L x L can be obtained by integrating the corner 
density, pdx, y) in U, where denotes the region II and III: 



L X L 

(Nc) = I pc{x,y)dxdy = 2 dx pc{x,y)dy = 2 dx 



px 



px 



3q^ 



(55) 

where q = 1 ~ p. It is easily seen that at p ^ the averaged number of corners, (Nc) tends to the value pL^ that 
corresponds to the mean of the Poisson distributed corners or nucleation centers. 

We conclude this section with the following remark. The definition of the density of level lines, p, deserves special 
attention. We distinguish the 'global', pgi, and 'local', pioc, densities, defined respectively as follows: 



m _ L{x,y) 



Ploc 



dL{x,y) 



dx 



2Vp 

x=y=N 1 + Vp 



(56a) 
(56b) 



Note that Eqs. ([33|) -([35 l) are consistent with the density p — pioc defined in (|56bp . 



V. CONCLUSION 



To summarize, in this paper we have shown how to use the Bethe ansatz technique to compute the average number 
of lines and corners in a 2D 5-vertex model that originated from the Bernoulli Matching model of the alignment of 
two random sequences. The asymptotic result for the average number of lines in a square of size (n x n), which in the 
Bernoulli Matching model is the average length {L^^^) of the longest match between two random sequences each of 
length n, was already known by several alternate methods. These include a purely probabilistic method [2l'|, the cavity 
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method of the spin glass physics [1^ , via mapping to a lattice gas of interacting particle systems [23| and also via a 
mapping Q to the Johansson's directed polymer problem [25j . Here, we have provided yet another method namely 
the standard Bethe ansatz technique to compute this asymptotic result. Moreover, we were also able to compute, by 
this method, the average density of corners in the 5-vertex model which, to our knowledge, is a new result. While the 
number of lines has an immediate physical meaning in the context of the sequence matching problem (namely, this 
is just the number of optimal matches between two random sequences), it is difficult to relate the corner density in 
the 2D plane to a directly measurable observable in the sequence matching problem. In an indirect way the corner 
density is related to the degeneracy of the optimal match. However, the corner density has a direct physical meaning 
in terms of the world lines of the totally asymmetric exclusion process (TASEP). The corner density is just the total 
number of particle jumps per unit time and per unit length. 

There are several models and the associated techniques that are directly or indirectly related to the Bernoulli 
Matching model @ . These models include the (1 + 1 )-dimensional directed polymer problem in a random mediurn [25| , 
an anisotropic (2 + l)-dimensional directed percolation model [l^], the longest increasing subsequence problem |35|, a 
(1 + l)-dimensional anisotropic ballistic deposition model [36l | and also the 2-dimensional 5~vertex model described 
in this paper. They all share the common fact that there is a suitable random variable in the model whose limiting 
distribution is described by the Tracy- Widom law first appeared as the limit distribution of the largest eigenvalue 
of a random matrix drawn from the Gaussian unitary ensemble. In FiglSl we provide a flowchart of these different 
models and the connections between them are depicted by numbered arrows which designate the methods of solutions 
of listed problems. Below we briefly comment on these connections. 




Directed polymer 
in random media 




Longest increasing 
subsequence 
and asymmetric 
ballistic deposition 



-<D- 



(Totally) asymmetric 
exclusion process 




Bernoulli matching 
model of longest 
common subsequence 



(2+1 )D anisotropic 
directed percolation 



■<8>- 



5-vertex model 
(Bethe ansatz) 



FIG. 5: A flowchart of the models related to the Bernoulli Matching model of sequence alignment. 



The first comprehensive solution for the distribution of the ground state energy of directed polymer in random media 
(DPRM) has been obtained by Johansson in 25] by mapping this model to the longest increasing (non-decreasing) 
subsequence (LIS) in a random sequence of integers, known also as Ulam problem (sec arrow 1). He also discussed 
the possible connection between the asymmetric exclusion process (ASEP) and LIS but since this relation has not 
been deeply exploited, we have assigned to it a dashed arrow 3. The relation between ASEP and DPRM, arrow 2 is 
well-known [37| |. The works [l^] and [s^ have offered the possibility for direct geometrical connection, shown by the 
arrow 4 between the Tracy- Widom distribution of LIS and the scaled height in the Anisotropic directed percolation 
(ADP) model. The relation between the Bernoulli Matching (BM) model and ADP model (the arrow 5) is given by 
the nonlinear transform established in Q. The solution of asymmetric exclusion process by Bethe ansatz, shown by 
the arrow 7 is a subject of many investigations (see [3^ for references). As it has been mentioned in [2^, the ADP 
model can be solved by mapping it to a 5-vertex model (this link is shown by the arrow 8) providing an alternative 
derivation (using Bethe ansatz) of some results dealing with the statistics of LIS. 
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The shaded cehs connected by the arrow 6 constitute the subject of our current work - the use of the Bethe ansatz 
technique for the 2-dimensional 5-vertex model. This thus adds one more standard technique of statistical physics 
to the number of existing methods that have already been used for investigation of problems depicted in FigO A 
challenging forthcoming goal would be the possibility to extract directly the Tracy- Widom distribution for the scaled 
height in the BM model using the Bethe ansatz method. We note that the appearance of the Tracy- Widom law for 
the distribution of total current in a fragmentation model (which includes the TASEP) was explained before using the 
coordinate Bethe Ansatz method in . This fragmentation model is to the particle hopping model considered in our 
work provided one exchanges the particles and holes. However making a link between the total current distribution 
in the fragmentation model and the height distribution in the BM model is not so straightforward and remains an 
open question. Some recent attempts have been made in this direction in [391] . 

Finally let us note that the BM is a special case of a more general sequence alignment problem, where one includes 
a penalty (or cost) of mismatches Despite the fact that the recursion relation for the cost function can still be 
interpreted as a generalized directed polymer (DP) problem, the terrace-like three-dimensional structure (and hence 
the connection to the vertex models) in no longer valid [40, Sll . There is not clear if one could use the Bethe Ansatz 
for this generalized DP problem. 
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